To clean the data, I first turned all rating columns into numeric columns (for easier rounding and prediction later on)
#str(gs)
asnum = function(x) as.numeric(as.character(x))
facnum = function(y) modifyList(y, lapply(y[, sapply(y, is.factor)],asnum))
gs = facnum(gs)
#changing word errors
#ORGA/B/C/D to 'e'
gs[317:326,1] = gsub("ORG[A-Z]","e", gs[317:326,1])
gs[317:326,2] = gsub("ORG[A-Z]","e", gs[317:326,2])
gs[317:326,3] = gsub("ORG[A-Z]","e", gs[317:326,3])
gs[1641,1] = gsub("ORG[A-Z]","a", gs[1641,1])
gs[1641,2] = gsub("ORG[A-Z]","a", gs[1641,2])
#splitting words that run together
gs$pros = gsub("([a-z])([A-Z])", "\\1 \\2", gs$pros)
gs$cons = gsub("([a-z])([A-Z])", "\\1 \\2", gs$cons)
gs$advice = gsub("([a-z])([A-Z])", "\\1 \\2", gs$advice)
#line 374?
#removing punctuations
gs$pros = gsub("[[:punct:]]", " ", gs$pros)
gs$cons = gsub("[[:punct:]]", " ", gs$cons)
gs$advice = gsub("[[:punct:]]", " ", gs$advice)
#removing numbers
gs$pros = removeNumbers(gs$pros)
gs$cons = removeNumbers(gs$cons)
gs$advice = removeNumbers(gs$advice)
#trimming whitespace
gs$pros = str_squish(gs$pros)
gs$cons = str_squish(gs$cons)
gs$advice = str_squish(gs$advice)
#removing rows containing other languages
foreignwords = c('auf','und','ganz','von','gute','zeit','viele','keine','ich','eine','mondiale','klarer','goede','nach', 'esprit')
gs = gs %>%
filter(!grepl(paste(foreignwords, collapse="|"), pros))
gs = gs %>%
mutate(organization = as.factor(organization))
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NAUsing the nrc lexicon for my analysis.
gs = gs %>%
mutate(id = 1:n()) %>%
select(id, everything())
#get_sentiments("nrc")
nrcWord = textdata::lexicon_nrc()
nrcValues = lexicon::hash_sentiment_nrcFor the foundation of my sentiment analysis, I’ll be splitting and assessing sentiment for pros, cons, and advice. I’ll also be placing average sentiments within the main database to see any trends that appear.
### PROS
pros = gs %>%
select(id, pros) %>%
unnest_tokens(., word, pros)
pros$word = removeWords(pros$word, words = stopwords("en"))
pros = pros %>%
filter(word != "")
pros = pros %>%
inner_join(nrcValues, by = c("word" = "x"))
pros = pros %>%
inner_join(nrcWord, by = c("word" = "word"))
# pros %>%
# group_by(id) %>%
# summarize(ave = mean(y))
pros = pros %>%
rename("pro_word" = "word",
"pro_score" = "y",
"pro_sent" = "sentiment")
### CONS
cons = gs %>%
select(id, cons) %>%
unnest_tokens(., word, cons)
cons$word = removeWords(cons$word, words = stopwords("en"))
cons = cons %>%
filter(word != "")
cons = cons %>%
inner_join(nrcValues, by = c("word" = "x"))
cons = cons %>%
inner_join(nrcWord, by = c("word" = "word"))
cons = cons %>%
rename("con_word" = "word",
"con_score" = "y",
"con_sent" = "sentiment")
### ADVICE
advice = gs %>%
select(id, advice) %>%
unnest_tokens(., word, advice)
advice$word = removeWords(advice$word, words = stopwords("en"))
advice = advice %>%
filter(word != "")
advice = advice %>%
inner_join(nrcValues, by = c("word" = "x"))
advice = advice %>%
inner_join(nrcWord, by = c("word" = "word"))
advice = advice %>%
rename("adv_word" = "word",
"adv_score" = "y",
"adv_sent" = "sentiment")For my analysis, I’ll first be taking a look at frequency of sentiments and words within those sentiments/types of reviews to see if there are any trends within the data (in terms of what major considerations are, what topics are mostly talked about, etc). I’ll also be placing average sentiment values within the main dataset to see if there are any strong relationships within those variables.
First, let’s get the frequency of sentiments and words within those sentiments in pro reviews.
### Getting Frequency in Pro Reviews
pros %>%
group_by(pro_sent) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
ggplot(aes(reorder(pro_sent, count), count, fill = pro_sent)) +
geom_bar(alpha = 1, show.legend = FALSE, stat = "identity") +
coord_flip() +
scale_y_continuous(expand = c(0,0)) +
labs(title = "Frequency per Sentiment in Pro Reviews",
y = "Occurrences",
x = "") +
theme(text = element_text(family = "Avenir Next",
color = 'white'),
plot.title = element_text(color = "#0CAA41",
face = "bold"),
axis.text = element_text(color = 'white'),
plot.background = element_rect(fill = 'black'),
panel.background = element_rect(fill = 'black'))In pro reviews, generally positive sentiments (positive, trust, anticipation, joy, surprise) far outweigh the rest of the other sentiments. This is what we expect from these reviews. Let’s see what words are used the most for each sentiment.
pros %>%
count(pro_sent, pro_word) %>%
filter(pro_sent %in% c("positive","negative","joy","trust","fear","anger")) %>%
group_by(pro_sent) %>%
top_n(10) %>%
ungroup %>%
mutate(pro_word = reorder(pro_word, n)) %>%
mutate(pro_sent = as.factor(pro_sent)) %>%
ggplot(mapping = aes(pro_word, n, fill = pro_sent)) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "identity") +
coord_flip() +
scale_y_continuous(expand = c(0,0)) +
facet_wrap(~pro_sent, scales = "free") +
labs(title = "Frequency of Sentiment/Words in Pro Reviews",
y = "Total Number of Occurrences",
x = "") +
theme(text = element_text(family = "Avenir Next",
color = 'white'),
plot.title = element_text(color = "#0CAA41",
face = "bold"),
axis.text = element_text(color = 'white'),
plot.background = element_rect(fill = 'black'),
panel.background = element_rect(fill = 'black'))We can see here that “good” obviously makes up most of the occurrences for sentiment, while “management” comes at a close second - indicating that management (among other factors like pay, culture, etc which are also highly mentioned) could be a key consideration when employees assess a workplace.
`%notin%` = Negate(`%in%`) #this should really be its own function
bigrams_pro = gs %>%
select(pros) %>%
unnest_tokens(., ngrams, pros, token = "ngrams", n = 2) %>%
tidyr::separate(ngrams, c("word1","word2"), sep = "\\s") %>%
count(word1, word2, sort = TRUE) %>%
filter(word1 %notin% stopwords("en")) %>%
filter(word2 %notin% stopwords("en"))
datatable(bigrams_pro, options = list(pageLength = 10)) %>%
formatStyle('n', color = "black", backgroundColor = "#0CAA41",fontWeight = "bold") %>%
formatStyle(c(' ','word1','word2'),backgroundColor = "black")trigrams_pro = gs %>%
select(pros) %>%
unnest_tokens(., ngrams, pros, token = "ngrams", n = 3) %>%
tidyr::separate(ngrams, c("word1","word2","word3"), sep = "\\s") %>%
count(word1, word2, word3, sort = TRUE) %>%
filter(word1 %notin% stopwords("en")) %>%
filter(word2 %notin% stopwords("en")) %>%
filter(word3 %notin% stopwords("en"))
datatable(trigrams_pro, options = list(pageLength = 10)) %>%
formatStyle('n', color = "black", backgroundColor = "#0CAA41",fontWeight = "bold") %>%
formatStyle(c(' ','word1','word2','word3'),backgroundColor = "black")Let’s see what the trends are in con reviews.
cons %>%
group_by(con_sent) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
ggplot(aes(reorder(con_sent, count), count, fill = con_sent)) +
geom_bar(alpha = 1, show.legend = FALSE, stat = "identity") +
coord_flip() +
scale_y_continuous(expand = c(0,0)) +
labs(title = "Frequency per Sentiment in Con Reviews",
y = "Occurrences",
x = "") +
theme(text = element_text(family = "Avenir Next",
color = 'white'),
plot.title = element_text(color = "#0CAA41",
face = "bold"),
axis.text = element_text(color = 'white'),
plot.background = element_rect(fill = 'black'),
panel.background = element_rect(fill = 'black'))We can still see a majority of word occurrences being in the “positive” sentiment, with a mix of both positive and negative sentiments in the rankings. Let’s see which words occur the most for relevant sentiments.
cons %>%
count(con_sent, con_word) %>%
filter(con_sent %in% c("negative","trust","sadness","fear","anger","disgust")) %>%
group_by(con_sent) %>%
top_n(10) %>%
ungroup %>%
mutate(con_word = reorder(con_word, n)) %>%
mutate(con_sent = as.factor(con_sent)) %>%
ggplot(mapping = aes(con_word, n, fill = con_sent)) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "identity") +
coord_flip() +
scale_y_continuous(expand = c(0,0)) +
facet_wrap(~con_sent, scales = "free") +
labs(title = "Frequency of Sentiment/Words in Con Reviews",
y = "Total Number of Occurrences",
x = "") +
theme(text = element_text(family = "Avenir Next",
color = 'white'),
plot.title = element_text(color = "#0CAA41",
face = "bold"),
axis.text = element_text(color = 'white'),
plot.background = element_rect(fill = 'black'),
panel.background = element_rect(fill = 'black'))Again, “management” is one of the most talked-about topics in the reviews, with words like “limited” and “lack” (which connotes something not meeting expectation) having high frequency as well. “Terrible” and “horrible” also make a lot of appearances in these lists, which gives a sense of the degree of negativity in these reviews.
Let’s look at which ngrams appear the most.
bigrams_con = gs %>%
select(cons) %>%
unnest_tokens(., ngrams, cons, token = "ngrams", n = 2) %>%
tidyr::separate(ngrams, c("word1","word2"), sep = "\\s") %>%
count(word1, word2, sort = TRUE) %>%
filter(word1 %notin% stopwords("en")) %>%
filter(word2 %notin% stopwords("en")) %>%
filter(word1 != "t") %>%
filter(word2 != "t")
datatable(bigrams_con, options = list(pageLength = 10)) %>%
formatStyle('n', color = "black", backgroundColor = "#0CAA41",fontWeight = "bold") %>%
formatStyle(c(' ','word1','word2'),backgroundColor = "black")trigrams_con = gs %>%
select(cons) %>%
unnest_tokens(., ngrams, cons, token = "ngrams", n = 3) %>%
tidyr::separate(ngrams, c("word1","word2","word3"), sep = "\\s") %>%
count(word1, word2, word3, sort = TRUE) %>%
filter(word1 %notin% stopwords("en")) %>%
filter(word2 %notin% stopwords("en")) %>%
filter(word3 %notin% stopwords("en")) %>%
filter(word1 != "t") %>%
filter(word2 != "t") %>%
filter(word3 != "t")
datatable(trigrams_con, options = list(pageLength = 10)) %>%
formatStyle('n', color = "black", backgroundColor = "#0CAA41",fontWeight = "bold") %>%
formatStyle(c(' ','word1','word2','word3'),backgroundColor = "black")A lot of reviews again seem to include the terms “work-life balance” (by an overwhelming margin), “senior/upper management”. The reviews also now include mention of “long working hours”, “old boys club”, and “extremely high turnover” - which puts a lot of concern on job security and treatment of certain subgroups within companies. Again, a lot of consideration seems to be placed on work-life balance and management decisions (on HR, opportunities within the company, and overall working hours).
Lastly, let’s see what the trends are for advice.
advice %>%
group_by(adv_sent) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
ggplot(aes(reorder(adv_sent, count), count, fill = adv_sent)) +
geom_bar(alpha = 1, show.legend = FALSE, stat = "identity") +
coord_flip() +
scale_y_continuous(expand = c(0,0)) +
labs(title = "Frequency per Sentiment in Company Advice",
y = "Occurrences",
x = "") +
theme(text = element_text(family = "Avenir Next",
color = 'white'),
plot.title = element_text(color = "#0CAA41",
face = "bold"),
axis.text = element_text(color = 'white'),
plot.background = element_rect(fill = 'black'),
panel.background = element_rect(fill = 'black'))The results for advice are more similar to pro reviews rather than con reviews. Again, a lot of emphasis is placed on trust and anticipation - which is to be expected for advice inputs.
Taking a look at word distribution for relevant sentiments…
advice %>%
count(adv_sent, adv_word) %>%
filter(adv_sent %in% c("positive","trust","anticipation","joy","negative","anger","sadness", "fear")) %>%
group_by(adv_sent) %>%
top_n(10) %>%
ungroup %>%
mutate(adv_word = reorder(adv_word, n)) %>%
mutate(adv_sent = as.factor(adv_sent)) %>%
ggplot(mapping = aes(adv_word, n, fill = adv_sent)) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "identity") +
coord_flip() +
scale_y_continuous(expand = c(0,0)) +
facet_wrap(~adv_sent, scales = "free") +
labs(title = "Frequency of Sentiment/Words in Company Advice",
y = "Total Number of Occurrences",
x = "") +
theme(text = element_text(family = "Avenir Next",
color = 'white'),
plot.title = element_text(color = "#0CAA41",
face = "bold"),
axis.text = element_text(color = 'white'),
plot.background = element_rect(fill = 'black'),
panel.background = element_rect(fill = 'black'))Management and pay/benefits seem to be a common topic among advice, as well as “leave”. This doesn’t really tell us what exactly these comments are talking about. So let’s look at n-grams.
bigrams_adv = gs %>%
select(advice) %>%
na.omit() %>%
unnest_tokens(., ngrams, advice, token = "ngrams", n = 2) %>%
tidyr::separate(ngrams, c("word1","word2"), sep = "\\s") %>%
count(word1, word2, sort = TRUE) %>%
filter(word1 %notin% stopwords("en")) %>%
filter(word2 %notin% stopwords("en")) %>%
filter(word1 != "t") %>%
filter(word2 != "t") %>%
filter(word2 != "s")
datatable(bigrams_adv, options = list(pageLength = 10)) %>%
formatStyle('n', color = "black", backgroundColor = "#0CAA41",fontWeight = "bold") %>%
formatStyle(c(' ','word1','word2'),backgroundColor = "black")trigrams_adv = gs %>%
select(advice) %>%
na.omit() %>%
unnest_tokens(., ngrams, advice, token = "ngrams", n = 3) %>%
tidyr::separate(ngrams, c("word1","word2","word3"), sep = "\\s") %>%
count(word1, word2, word3, sort = TRUE) %>%
filter(word1 %notin% stopwords("en")) %>%
filter(word2 %notin% stopwords("en")) %>%
filter(word3 %notin% stopwords("en")) %>%
filter(word1 != "t") %>%
filter(word2 != "t") %>%
filter(word3 != "t")
datatable(trigrams_adv, options = list(pageLength = 10)) %>%
formatStyle('n', color = "black", backgroundColor = "#0CAA41",fontWeight = "bold") %>%
formatStyle(c(' ','word1','word2','word3'),backgroundColor = "black")In terms of bigrams, the major considerations remain mostly the same, especially with work-life balance. In terms of trigrams, the situation is very similar. Work life balance is mentioned more often, along with some other topics like “long-term strategy/longer-term view”, “performance review process”, “employee engagement programs”. Other than work-life balance, consideration also seems to be placed on long-term strategies (which could go with our previous observation of high turnover being mentioned a lot) and employee engagement/welfare.
Let’s look for any significant relationships between the ratings and any of the text features, and see if there are any patterns in these relationships.
proSummary = pros %>%
group_by(id) %>%
summarize(sumPro = sum(pro_score),
countPro = n(),
posPro = sum(pro_sent == "positive"),
negPro = sum(pro_sent == "negative"))
conSummary = cons %>%
group_by(id) %>%
summarize(sumCon = sum(con_score),
countCon = n(),
posCon = sum(con_sent == "positive"),
negCon = sum(con_sent == "negative"))
advSummary = advice %>%
group_by(id) %>%
summarize(sumAdv = sum(adv_score),
countAdv = n(),
posAdv = sum(adv_sent == "positive"),
negAdv = sum(adv_sent == "negative"))
fullSumm = proSummary %>%
full_join(conSummary, by = c("id" = "id")) %>%
full_join(advSummary, by = c("id" = "id")) %>%
arrange(id)
fullSumm = fullSumm %>%
mutate(sum_sent = rowSums(subset(fullSumm, select = c(sumPro, sumCon, sumAdv)), na.rm = TRUE),
sum_count = rowSums(subset(fullSumm, select = c(countPro, countCon, countAdv)), na.rm = TRUE),
ave_pro = sumPro / countPro,
ave_con = sumCon / countCon,
ave_adv = sumAdv / countAdv,
ave_sent = sum_sent/sum_count,
sum_pos = rowSums(subset(fullSumm, select = c(posPro, posCon, posAdv)), na.rm = TRUE),
sum_neg = rowSums(subset(fullSumm, select = c(negPro, negCon, negAdv)), na.rm = TRUE)) %>%
select(id, ave_pro, ave_con, ave_adv, ave_sent, sum_pos, sum_neg)
gs = gs %>%
full_join(fullSumm, by = c("id" = "id"))library(sjPlot)
rating_lm = lm(rating ~ ave_pro + ave_con + ave_adv + ave_sent + sum_pos + sum_neg, data = gs)
wlr_lm = lm(workLifeRating ~ ave_pro + ave_con + ave_adv + ave_sent + sum_pos + sum_neg, data = gs)
cvr_lm = lm(cultureValueRating ~ ave_pro + ave_con + ave_adv + ave_sent + sum_pos + sum_neg, data = gs)
cor_lm = lm(careerOpportunityRating ~ ave_pro + ave_con + ave_adv + ave_sent + sum_pos + sum_neg, data = gs)
cbr_lm = lm(compBenefitsRating ~ ave_pro + ave_con + ave_adv + ave_sent + sum_pos + sum_neg, data = gs)
mr_lm = lm(managementRating ~ ave_pro + ave_con + ave_adv + ave_sent + sum_pos + sum_neg, data = gs)
# summary(rating_lm)
# summary(wlr_lm)
# summary(cvr_lm)
# summary(cor_lm)
# summary(cbr_lm)
# summary(mr_lm)
#
# knitr::kable(tidy(rating_lm), caption = "Ratings")
tab_model(rating_lm, wlr_lm, cvr_lm, cor_lm, cbr_lm, mr_lm,
CSS = list(
css.firsttablecol = 'font-weight: bold'))| rating | work Life Rating | culture Value Rating | career Opportunity Rating | comp Benefits Rating | management Rating | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 3.02 | 2.66 – 3.37 | <0.001 | 3.02 | 2.64 – 3.41 | <0.001 | 2.99 | 2.57 – 3.41 | <0.001 | 3.05 | 2.65 – 3.44 | <0.001 | 2.79 | 2.41 – 3.17 | <0.001 | 2.78 | 2.34 – 3.21 | <0.001 |
| ave_pro | -0.25 | -0.54 – 0.05 | 0.109 | -0.13 | -0.44 – 0.18 | 0.409 | -0.22 | -0.56 – 0.13 | 0.218 | -0.09 | -0.41 – 0.24 | 0.606 | 0.10 | -0.21 – 0.41 | 0.533 | -0.17 | -0.54 – 0.20 | 0.359 |
| ave_con | -0.47 | -0.70 – -0.24 | <0.001 | -0.27 | -0.51 – -0.03 | 0.028 | -0.52 | -0.79 – -0.26 | <0.001 | -0.45 | -0.70 – -0.20 | <0.001 | -0.52 | -0.76 – -0.29 | <0.001 | -0.44 | -0.72 – -0.16 | 0.002 |
| ave_adv | -0.32 | -0.57 – -0.08 | 0.009 | 0.00 | -0.25 – 0.26 | 0.973 | -0.20 | -0.48 – 0.08 | 0.155 | -0.22 | -0.48 – 0.04 | 0.095 | -0.17 | -0.42 – 0.07 | 0.169 | -0.04 | -0.33 – 0.25 | 0.771 |
| ave_sent | 1.53 | 0.90 – 2.16 | <0.001 | 0.68 | 0.03 – 1.34 | 0.040 | 1.29 | 0.58 – 2.01 | <0.001 | 0.86 | 0.18 – 1.53 | 0.013 | 0.95 | 0.30 – 1.59 | 0.004 | 0.95 | 0.18 – 1.72 | 0.015 |
| sum_pos | 0.01 | -0.01 – 0.04 | 0.423 | 0.01 | -0.01 – 0.04 | 0.304 | 0.01 | -0.01 – 0.04 | 0.342 | 0.02 | -0.00 – 0.05 | 0.073 | 0.01 | -0.02 – 0.03 | 0.528 | 0.01 | -0.02 – 0.04 | 0.485 |
| sum_neg | -0.21 | -0.27 – -0.14 | <0.001 | -0.17 | -0.24 – -0.09 | <0.001 | -0.20 | -0.28 – -0.12 | <0.001 | -0.21 | -0.29 – -0.14 | <0.001 | -0.13 | -0.21 – -0.06 | <0.001 | -0.21 | -0.29 – -0.13 | <0.001 |
| Observations | 585 | 559 | 557 | 557 | 555 | 472 | ||||||||||||
| R2 / R2 adjusted | 0.251 / 0.243 | 0.138 / 0.129 | 0.180 / 0.171 | 0.162 / 0.153 | 0.131 / 0.121 | 0.197 / 0.186 | ||||||||||||
We can see a clear trend with these models. The text features (mostly) that have the highest significance on each of the ratings are: Average Sentiment for Cons, Average Sentiment Overall, and Sum of Negative Sentiment Words. This could mean something for the companies, as an increase in negative comments could lead to a larger fall in any kind of rating for these organizations (higher average sentiment does not only mean high positivity in the cons section since it is a con section, it could mean that more descriptive words are used and more detail is put into describing the cons of the company).
Let’s visualize some of these relationships (will not be visualizing all, just making a point here).
gs %>%
ggplot(aes(x = ave_con, y = rating, col = organization)) +
geom_point(show.legend = FALSE) +
geom_smooth(method = "lm", col = "#0CAA41") +
facet_wrap(~organization) +
labs(title = "Relationship between Overall Rating and Average Sentiment for Cons",
y = "Overall Rating",
x = "Average Sentiment for Cons") +
theme(text = element_text(family = "Avenir Next",
color = 'white'),
plot.title = element_text(color = "#0CAA41",
face = "bold"),
axis.text = element_text(color = 'white'),
axis.title = element_text(face = "bold",
color = "#0CAA41"),
plot.background = element_rect(fill = 'black'),
panel.background = element_rect(fill = 'black'))gs %>%
ggplot(aes(x = ave_sent, y = rating, col = organization)) +
geom_point(show.legend = FALSE) +
geom_smooth(method = "lm", col = "#0CAA41") +
facet_wrap(~organization) +
labs(title = "Relationship between Overall Rating and Average Overall Sentiment",
y = "Overall Rating",
x = "Average Sentiment") +
theme(text = element_text(family = "Avenir Next",
color = 'white'),
plot.title = element_text(color = "#0CAA41",
face = "bold"),
axis.text = element_text(color = 'white'),
axis.title = element_text(face = "bold",
color = "#0CAA41"),
plot.background = element_rect(fill = 'black'),
panel.background = element_rect(fill = 'black'))gs %>%
ggplot(aes(x = sum_neg, y = rating, col = organization)) +
geom_point(show.legend = FALSE) +
geom_smooth(method = "lm", col = "#0CAA41") +
facet_wrap(~organization) +
labs(title = "Relationship between Overall Rating and Total Negative Sentiment Occurrences",
y = "Overall Rating",
x = "Occurrences of Negative Sentiment Words") +
theme(text = element_text(family = "Avenir Next",
color = 'white'),
plot.title = element_text(color = "#0CAA41",
face = "bold"),
axis.text = element_text(color = 'white'),
axis.title = element_text(face = "bold",
color = "#0CAA41"),
plot.background = element_rect(fill = 'black'),
panel.background = element_rect(fill = 'black'))From these visualizations, we don’t see much of an effect as Average Sentiment for Cons increases (as opposed to the regression summaries themselves), but we do see more significant effects as Average Overall Sentiment and Occurrences of Negative Sentiment Words increases. This does make some sort of sense, as relatively negative comments usually have more effect on a person’s POV on anything compared to relatively positive comments. People do like seeing strengths of companies and people, but they hate seeing flaws even more.
# Fixing Dataset
library(stm)
gstm = gs %>%
mutate(fulltext = paste(pros, cons, advice, sep = " ")) %>%
select(id, organization, fulltext, rating, workLifeRating, cultureValueRating, careerOpportunityRating, compBenefitsRating, managementRating)
gstm$fulltext = str_remove_all(gstm$fulltext, "NA")
set.seed(1001)
holdoutRows = sample(1:nrow(gstm), 100, replace = FALSE)
reviewText = textProcessor(documents = gstm$fulltext[-c(holdoutRows)],
metadata = gstm[-c(holdoutRows), ],
stem = FALSE)## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Creating Output...
reviewPrep = prepDocuments(documents = reviewText$documents,
vocab = reviewText$vocab,
meta = reviewText$meta)## Removing 3356 of 6729 terms (3356 of 44826 tokens) due to frequency
## Your corpus now has 1470 documents, 3373 terms and 41470 tokens.
load("/Volumes/GoogleDrive/My Drive/Laptop Files/1920B1/MW 1 Unstructured Data Analytics/02 Assignments/Homework 2/Homework 2/kOut.RData")
kTest = kTest
plot(kTest)It looks like 4 topics is the ideal number since it’s at that point that the diagnostic values drop off significantly (though there are some fluctuations in semantic coherence).
Time to determine our topic models.
topics4 = stm(documents = reviewPrep$documents,
vocab = reviewPrep$vocab, seed = 1001,
K = 4, verbose = FALSE)
plot(topics4)## Topic 1 Top Words:
## Highest Prob: people, will, don, like, just, take, know
## FREX: will, know, money, bad, months, anything, owner
## Lift: abusive, accidental, accommodation, accomodation, agents, agreed, allegations
## Score: will, don, money, know, pros, bad, people
## Topic 2 Top Words:
## Highest Prob: projects, people, culture, clients, consulting, firm, client
## FREX: projects, consulting, firm, consultants, compensation, interesting, firms
## Lift: acceptable, accordingly, accounts, achievement, adam, adopted, agile
## Score: projects, consulting, firm, clients, consultants, senior, client
## Topic 3 Top Words:
## Highest Prob: management, staff, team, training, new, company, small
## FREX: staff, training, leadership, support, orgb, communication, corporate
## Lift: according, accountable, acted, active, activity, agenda, alabama
## Score: staff, leadership, management, support, training, orgb, team
## Topic 4 Top Words:
## Highest Prob: work, good, company, great, employees, can, time
## FREX: pay, life, place, hard, best, none, way
## Lift: accounting, advising, annoying, appraisal, besides, boring, ceos
## Score: work, company, good, pay, employees, time, place
After scanning through the different topics, I could say that they’re pretty diverse: topic 1 seems to talk a lot about cons of the company (rather broadly), topic 2 revolves around consulting firms and the corporate ladder, topic 3 talks mostly about the work environment, and topic 4 seems to be about benefits.
##
## Topic 1:
## There are no Pros in working for this company One day was all that was necessary to see a descending career Everthing I read about this company was and is true Allegations charges complaints usery etc IT IS TRUE Do not waste your time if what you re reading is telling you not to proceed I took the week long training online Received my so called leads on Saturday All lead cards read with positive messages such as he is excited you re local he s anticipating the meeting they are open to the discussion they d love to take advantage of what we have to offer etc I drove miles to the appoinments and back home that Monday Of the leads I saw He said he didn t get a call and is already sserviced by someone else The other were NO GOOD Like the others said you can complain to management but they ll tell you hang in there You must have done something wrong The same every no is closer to a yes Except in this setup every no is closer to you being broke making NO and chased off the customers premises I ended the relationship that Monday evening IT S NOT WORTH YOUR TIME If I were allowed to use my own network at least I d get in the door However once my network were to research this company I d no longer have access to that network I m glad i didn t expose my network to ORGC Do not pass GO Do not collect anymore fees Shut your doors immediately
## No Pros at all don t go for it Completely a fraud Organization just know how to abduct money from innocent people said they are giving us free training online after one year they are harassing us by collections Agency and Lawyers they are calling repeatedly for Don t go for this Organization no one is picking up the phone when I am calling them very cheap dirty no morals organization
## Pros there are none when you realize that no matter what you will not get paid The cons are what you will read throughout this post Tonya Mcmillian her mother Doctor Houston and her main minion RJ Lewis are liars thieves and shysters They hire you as a contractor so that you have no rights under the department of labor They tell you that you ll be placed months in the hole so you have to invest your own personal money buying gas food for clients etc Once you re months in a different owed thousands of dollars you have to quit They then hire a new group of people and start over again She is running businesses and living a lavish lifestyle plus taking expensive trips while people are going under losing their homes and vehicles Your actions are irresponsible and criminal This is theft by deception at the very least and once enough people are together on this you re going to jail
## The only pros are the para pros and the therapists who naively come in to help the community This agency is not here to help others They are only in business to help themselves Their agency is riddled with corruption and disorganized The turnover rate of their employees is literally a revolving door of therapists with degrees who are worked to the point that they become the population that they are trying to serve Come to this agency if you just want to tell people that you have a job But come pay day they will find a reason for your check to be short They are always full of excuses This agency lives check to check If you work for this agency you will be in an never ending cycle of debt Evictions will happen to you because your landlord will not understand how you went to work for a company DAY IN AND DAY OUT and you still don t have the money to pay your rent You will waste all of your savings money gas on going to go see their clients but you will not realize until you are at rock bottom that even though you put the client first PMC does not care about you or your bills It makes ABSOLUTELY no sense to work for an agency that mismanages their funds and push back employees paychecks And lastly don t ever work for a company that tells you this All the clients that you see in January you will get paid in February Get your check every two weeks or don t work for them at all No matter how many times you switch offices company names or refuse to answer the phones everything that you have done to your current and former employees will follow you forever Your reputation is your wealth and you are looking mighty impoverished at this point
## Topic 2:
## The firm is THE place for professionals focused on gaining superior experience and a wide array of unique client engagement generally unseen even in best of global consulting firms The firms thrives by its ability to innovate each day re shaping its client offering to match the market needs and delivers solutions absolutely contemporary to the client There is no second SATHGURU in innovation Advisory and growth development consulting The firm is very low key and does not blow the trumpet for all the excellent international engagements it has performed and continues to perform Need to break this humility and talk about the accomplishments and the true strength of the firm Expand this opportunity and unique strengths several fold into more arena in the world
## ORGD engagements are almost exclusively state and local government oriented focused on improving citizen services with information technology Projects are relatively short and high impact The firm helps you develop your own consulting practice with its tools resourses and framework Projects and travel are intense The firm s reputation is built on very high levels of expertise quality and performance Don t compromise quality and consistency
## High degree of client exposure Rapid ramp up to responsibilities Great people International presence Prestigious client portfolio Lack of international exposure in some practices
## Smart but down to earth people collaborative and relatively flat work culture high quality high impact client work on the rise growing rapidly Operational sometimes tactical work Narrow expertise and skill set the firm is actively working on changing this Limited name recognition Average compensation relative to larger consulting firms Work culture becoming more competitive Becoming more of an up or out firm Focus on more strategic work Diversify capabilities and skill set Invest in improving brand awareness Strive to maintain a collaborative work culture
## Topic 3:
## Clean offices top notch technology conveniently located for where I lived at the time Non executive colleagues were professional and supported each other well Executive staff is personable Lack of leadership from the top executives results in dysfunctional workplace environment Company has no strategic plan no business development plans and no structure for growth or sustainability Hiring and firing occurs at whiplash speed with no explanation given to staff Continual restructuring and resistributing of executive responsibilities leads to confusion support staff is constantly having to report to different managers at different times For a management and training company they have a remarkably unenlightened view on training their own staff Leadership places new hires and personnel in management positions with no training or relevant experience related to the position Virtually every employee has at one time or another been handed the baton for business development and not a single one of them had has any background or experience in that area Support staff is often given the responsibility for securing new business managing and writing proposals and structuring management solutions without the necessary authority or experience Staff is often expected to burden repsonsibilities far above their paygrade picking up the slack left by the executive team Support and non supervisory staff do have experience with proposal writing proposal management contract management budget management and pricing However executive staff disregards this experience and continues to employ the same losing strategies upon which they have always relied Company mantra is if it hasn t worked so far just keep hitting your head against that wall until it caves Ridiculous amount of administration required to complete simple tasks COO s coordinator is working out of Alabama so all tasks have to be routed through him her before getting to anyone else in the office All staff responses also have to be routed to Alabama before proceeding Proposals are monitored out of Alabama while all the actual work writing managing editing researching is completed in Dumfries by the non management staff Still all pieces of the proposals are sent to Alabama and then forwarded to your colleague across the hall for changes edits etc Get some training for your staff Attend management and executive level training courses yourselves Hire a CEO who lives in the area and who has some management experience outside the military Hire a COO who stays in the office beyond pm or better yet one that actually works in the office rather than from home Eliminate the added layer of unnecessary administration so tasks can be completed more efficiently When you do hire experienced people listen to them don t fire them because they don t agree with your worldview
## geld wird gestaffelt erhöht wenn man druck ausübt zeitmanagment ist sehr wichtig für alle zeit gehalt ist zu wenig klima ist dagegen sehr gut managment ist fleißig neue bewerber besser über aufgaben informieren damit sie sich besser vorbereiten können so wird einarbeitung effektiver
## They attempt to appeal to employees with seasonal gifts such as turkeys ornaments with the company logo etc Hired teaching and support staff very friendly Provides lowest teacher salaries in the surrounding area benefits provided not comparable to benefits provided to teachers in public schools students transferring to other schools find themselves far behind their peers the school does not provide the resources financial and classroom materials necessary to give all attending students the high quality education they deserve the administration does not promote a healthy work environment the administration acts aggressively towards employees they are not personal friends with no opportunities for advancement Focus on the financial emotional and resource needs of the highly trained and experienced teaching staff before the needs of parents HR and administration The success and retention rate of students has a direct correlation to the success and experiences of the teachers
## The paychecks were always deposited on time There was no possibility of advancement This may not be the case at all locations but in the area I worked in the local branch management was laughably incompetent and was engaged in some activity of questionable ethics and legality However with no possible means provided to employees of reporting activities or lack thereof to senior management he would have been there forever had the military contract not expired Ensure local branch employees have a means of open communication to a regional or senior manager This could have prevented several of the problems that occurred at some of the remote stations
## Topic 4:
## Friendly working environment good pay part time student job easy to learn opportunity for promotion great boss great leads not too hard It is constantly full of people and noise if you work day shifts and if you work night shifts they go on till preety late Show leadership by example not by authority
## You are able to work from home They have a great HR department Nice annual trip The interims are great to work with They expect you to work hours a day and get paid for Employee benefits are high Negative work environment Learn how to treat your employees better
## Learn from your mistake that started when you joined here Not organized No inspiration from superiors Not much choice flexibility In short not a place to work Sell the place while you still can
## Highly autonomous and no yearly reviews means you can do anything you want and have no reprecussions great employer for those looking to slack off and not move up All employees are hourly employees Overtime pay time and a half however employees are often banned from submitting overtime hours during financial cut backs Management sexually harassed multiple female employees for years employees reached out to higher ups asking for HR contacts were told there were no HR representatives on staff and there was nothing they could do about it This was a lie employees later found out there are HR representatives on staff Took years of abuse to rid the company of the harrassers There is no employee handbook so corporate office is able to make up rules at will Examples of made up rules include unscheduled leave during government office OMB closures during snowmeggedon in DC when the government was closed for days in and contractors were unable to work ORGC forced employees to use their vacation days If an employee did not have enough vacation days they were forced to make it up during overtime Employees earn vacation days extremely slowly it takes a full year of work to earn vacation days There are no sick days you must use your vacation days Weird accounting system Employees MUST work a full hr day each week even if you work hr days during the rest of the week and amass the normal hr workweek but worked a hr day you must use hr of vacation time Get an employee handbook Manage your staff Cut the cancer
If we look at reviews that are highly associated with each topic, there seems to be something different with my interpretation. Topic 1 seems to have a lot of reviews that state negative perceptions then immediately negate them by building up the company. Topics 2 seems relatively in line with what my initial analyses were. Topic 3 further elaborates on the work environment but on the negative side, narrating experiences with dysfunctional work environments. Topic 4 on the other hand, does talk about benefits, but talks primarily about the lack of benefits and other cons of the company. There’s an even split (at least in these samples), but there’s more emphasis and volume placed in the negative reviews.
newReviewText = textProcessor(documents = gstm$fulltext[holdoutRows],
metadata = gstm[holdoutRows, ],
stem = FALSE)## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Creating Output...
## Your new corpus now has 100 documents, 1088 non-zero terms of 1356 total terms in the original set.
## 268 terms from the new data did not match.
## This means the new data contained 32.3% of the old terms
## and the old data contained 80.2% of the unique terms in the new data.
## You have retained 2991 tokens of the 3274 tokens you started with (91.4%).
newReviewFitted = fitNewDocuments(model = topics4, documents = newReviewCorp$documents,
newData = newReviewCorp$meta, origData = reviewPrep$meta)## ....................................................................................................
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Creating Output...
reviewPrep2 = prepDocuments(documents = predictorText$documents,
vocab = predictorText$vocab,
meta = predictorText$meta)## Removing 3442 of 6907 terms (3442 of 47718 tokens) due to frequency
## Your corpus now has 1570 documents, 3465 terms and 44276 tokens.
topicPredictor = stm(documents = reviewPrep2$documents,
vocab = reviewPrep2$vocab, prevalence = ~ rating,
data = reviewPrep2$meta, K = 4, verbose = FALSE)
ratingEffect = estimateEffect(1:4 ~ rating, stmobj = topicPredictor,
metadata = reviewPrep2$meta)
summary(ratingEffect, topics = c(1:4))##
## Call:
## estimateEffect(formula = 1:4 ~ rating, stmobj = topicPredictor,
## metadata = reviewPrep2$meta)
##
##
## Topic 1:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1160636 0.0028012 41.434 <2e-16 ***
## rating 0.0071497 0.0007593 9.417 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 2:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1689613 0.0032101 52.634 <2e-16 ***
## rating 0.0012009 0.0008664 1.386 0.166
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 3:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.343535 0.003847 89.300 <2e-16 ***
## rating -0.001865 0.001073 -1.738 0.0824 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Topic 4:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.371533 0.003874 95.913 < 2e-16 ***
## rating -0.006511 0.001052 -6.191 7.6e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot.estimateEffect(ratingEffect, "rating", method = "continuous",
model = topicPredictor, topics = 1, labeltype = "frex",
main = "Topic 1",
family = "Avenir Next")We can see that as the expected topic proportion for topic 1 rises, it has a very positive effect on rating. This could mean that reviews that negate negative effects to build up the company are written by people that look very highly on the organization and also want to raise the overall rating of the organization. Since they would want to build on the company and offset any negative impressions of the company, they would make their ratings significantly higher.
plot.estimateEffect(ratingEffect, "rating", method = "continuous",
model = topicPredictor, topics = 2, labeltype = "frex",
main = "Topic 2",
family = "Avenir Next")In this case, many consulting companies and employees value their work in terms of projects and opportunities. So, the more the topic is discussed (i.e. the reviewer has significantly more experiences/more to say about the experiences), the higher the rating is for the organization.
plot.estimateEffect(ratingEffect, "rating", method = "continuous",
model = topicPredictor, topics = 3, labeltype = "frex",
main = "Topic 3",
family = "Avenir Next")This topic now shows a negative relationship with org rating. As I discussed above, the topic generally talks about negative experiences within the work environment. So, the more those experiences are talked about, the lower the rating is for organizations. In this case, the relationship has higher variations with its results as the previous two, so some topics could also be talking about experiences with the work environment but not as negative (or even positive).
plot.estimateEffect(ratingEffect, "rating", method = "continuous",
model = topicPredictor, topics = 4, labeltype = "frex",
main = "Topic 4",
family = "Avenir Next")Obviously, this one would also be negative. The sample words shown connote very negative sentiments and the samples shown above have more emphasis and volume placed on the negative views within the topic. So, the more the topic is discussed, the overall rating of the organization sinks. There’s also less variation in the results.
Submitted by Gabby Herrera-Lim.